Mon. Not. R. Astron. Soc. 000.H1I2T1(2003) Printed 2 Febraary 2008 (MN L2TpX style file v2.2) 



Maximum-entropy image reconstruction using wavelets 



Klaus Maisinger, M.P. Hobson and A.N. Lasenby 

Astrophysics Group, Cavendish Laboratory, Madingley Road, Cambridge, CB3 OHE, UK 



m 
O 
o 



> 
(N 

m 
O 
m 
o 

Oh 
6 



13 



Accepted — . Received — ; in original form 9th March 2003 



ABSTRACT 

Wavelet functions allow the sparse and efficient representation of a signal at different scales. 
Recently the application of wavelets to the denoising of maps of cosmic microwave back- 
ground (CMB) fluctuations has been proposed. The maximum-entropy method (MEM) is 
also often used for enhancing astronomical images and has been applied to CMB data. In this 
paper, we give a systematic discussion of combining these two approaches by the use of the 
MEM in wavelet bases for the denoising and deconvolution of CMB maps and more general 
images. Certain types of wavelet transforms, such as the a trous transform, can be viewed as 
a multi-channel intrinsic correlation function (ICF). We find that the wavelet MEM has lower 
reconstruction residuals than conventional pixel-basis MEM in the case when the signal-to- 
noise ratio is low and the point spread function narrow. Furthermore, the Bayesian evidence 
for the wavelet MEM reconstructions is generally higher for a wide range of images. From a 
Bayesian point of view, the wavelet basis thus provides a better model of the image. 

Key words: methods: data analysis - methods: statistical - techniques: image processing 



1 INTRODUCTION 

Both the maximum-entropy method (MEM) and wavelet tech- 
niques are used for astronomical image enhancement. In particular, 
both methods have recently been applied to the analysis of CMB 
data (see, for instance, Hobson, lones & Lasenbv 1999; Sa nz et alJ 
Il999alri lTenorioetalJll999l) . Maps of CMB anisotropies are a 
useful tool in the analysis of CMB data. Making maps is rarely 
straightforward, since a multitude of systematic instrumental ef- 
fects, calibration uncertainties and other deficiencies in the mod- 
elling of the telescope come into play. For example, interferometric 
maps suffer from the telescope's incomplete sampling in Fourier 
space and require the deconvolution of the synthesised beam (e.g. 
Thompson, Mor an & Swensor]ll994) and the suppression of re- 
ceiver noise. CMB observations from single-dish telescopes use 
total power measurements and scan across the observed fields to 
assemble a map. Here it is the effect of the finite primary beam that 
needs to be deconvolved in order that a high-resolution map may be 
recovered. Beyond the area of the CMB, the task of image recon- 
struction is generic and occurs in virtually any type of astronomical 
map-making. 

In a general imaging problem, we assume that the data d ob- 
served by an experiment are given by a convolution of the true sky 
signal, or image h, with the point spread function P of the instru- 
ment, plus some Gaussian random noise n: 

d =P*h+n. 

In the discretised version, the data vector d is given by a multi- 
plication of the vector h of the image pixels with the instrumen- 
tal response matrix R that describes the convolution with the point 



spread function, and the additive noise vector n: 
d = Rh +n. 

To solve the inverse problem of recovering the image h from the 
data, some type of regularisation is usually required. A common 
technique is to use an entropic function 5 for the regularisation. 
The best reconstruction is then found by minimising the func- 
tion F(h) = jX 2 (h) ~ oS(h) that determines a suitable trade-off 
between a good fit to the data enforced by the % 2 -statistic and a 
strong regularisation given by the entropy S(h) of the reconstruc- 
tion. The maximum-entropy method has proven to be very success- 
ful for the deconvolution of noisy images. 

Despite its capabilities, the MEM suffers from several short- 
comings. For example, the appropriate entropy functional depends 
on the properties of the distribution of image pixels, but it is not 
always evident what the theoretical distribution should be. For pos- 
itive additive distributions, one uses the entropy 



S(h)- 



N h 



- h log ■ 



(1) 



where the sum is over all image pixels and m; is a measure assigned 
to pixel ('. Even in this case problems can arise when there is no 
appropriate background level, or the image brightness falls below 
the background level in some areas. 

Another defect of the MEM is that, in its simplest forms, cor- 
relations between image pixels are not taken into account properly. 
This problem manifests itself in several guises. Because of corre- 
lations between image pixels, the effective number of 'degrees of 
freedom' in the data is often much smaller than the number of pa- 
rameters in the minimisation problem, making effective regulari- 
sation more difficult. In fact, MEM is inherently based on the as- 
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sumption that image pixels are independent. Furthermore, ignor- 
ing correlations leads to the introduction of spurious features in the 
map, such as the characteristic ringing present on uniform back- 
grounds. There is no provision in the MEM algorithm to reward 
local smoothness of the image. It appears to be quite difficult to 
regularise in such a way as to reconstruct faithfully sharp features 
and uniform areas at the same time. 

Several solutions have been propo sed to remedy the problem 
of image correlations. iGull & Skillina ll999l) have introduced the 
concept of an intrinsic correlation function (ICF) that is used to 
decorrelate the reconstructed image. The ICF framework has been 
extended to allow reconstructions of objects on different scales. 
IWeiJ fl992) proposes a multi-channel approach, which allows for 
multiple scales of pixel-to-pixel correlations. In pyramidal max- 
imum entropy iBontekoe. Koper & Kesterl [l994). the number of 
pixels retained in the low-resolution channels is decimated. Despite 
these improvements, choosing an ICF is not straightforward. It is 
clear that there is no single set of ICFs that is universally optimal 
for all possible types of data. Choosing suitable scale lengths and 
weights is of great importance. 

A slightly different approach to tackling the correlation prob- 
lem is to use a representation of the image that is more efficient 
in identifying its information content. In other words, the task is 
to find an optimal basis set for the representation of the image. 
Furthermore, it is desirable to have a representation that can ef- 
ficiently capture information present on different length scales in 
the image. For instance, for CMB observations several foreground 
components, such as radio point sources or SZ-clusters, are very 
localised on the sky. Some theories for structure formation also 
predict localised non-Gaussian imprints at arcminute scales on the 
CMB itself, for example temperature fluctuations produced in the 
wake of cosmic strings. On the other hand, the primordial CMB it- 
self shows more diffuse structure that peaks on angular scales close 
to a degree. Representing these signals in real space (i.e. the image 
plane) requires large numbers of basis functions (i.e. the pixels) 
for a given image. Similarly, a reconstruction in Fourier space re- 
quires the determination of a large number of modes, which are 
often very poorly constrained by the data, since each localised fea- 
ture on a map is expanded into an infinite number of basis func- 
tions in Fourier space. Maximum entropy has been applied to re- 
constructions in both real and Fourier space. A F ourier space ap- 
proach has been developed by HobsorT e t alj jl9 98[) and has been 
applie d to simulated Planck data, while ljones. Hobson & Lasenbvl 
( 1999) simulate its use for the MAP satellite mission. 

The application of wavelets to CMB data an alysis has re cently 
been investigated (see, for instance,|Hob^on_glalJl9^^|Sanz_elay 
ll999ibtlTenorio et alJl999llCav6n et alfcOOdlVielva et alJ200ll) . 
Wavelets are special sets of functions that allow the efficient rep- 
resentation of signals both in real and in Fourier space. Further- 
more, they can represent different objects of greatly varying sizes 
simultaneously. The term 'wavelet' does not refer to a single unique 
function. Instead, it comprises a whole class of functions with sim- 
ilar properties. In the context of CMB analysis, wavelets have pre- 
dominantly been used for noise filtering or the separation of lo- 
calised foreground sources. A combination of ME M and certain 
typ es of wavele t s has been discussed by IPantin & Starckl 1 1996) 
and lStarck etall feOOll) . 

In this paper, we investigate the use of wavelets in MEM more 
closely. We will compare different wavelet transforms, entropic pri- 
ors and regularisations. We will also consider how the different ap- 
proaches can be viewed in the ICF framework. In Section [2] we 
give an introduction to wavelet transforms, and the application of 



wavelet filters to denoising is reviewed in Section [3] In Section |4] 
the maximum-entropy method is introduced. The combination of 
wavelet and MEM techniques is explored in Section [5] and the 
equivalence of intrinsic correlation functions and certain redundant 
wavelet transforms is discussed in Section|6| In Section^ we test 
the techniques presented by applying them to simulated image re- 
construction problems. 



2 THE WAVELET TRANSFORM 

Wavelets are functions that enjoy certain properties, which will be 
further discussed below. A wavelet basis can be constructed from 
dilations and translations of a given wavelet function. The wavelet 
transform is an integral transform that uses the wavelet basis func- 
tions. The most widely used classes of wavelet transforms are or- 
thogonal transforms. Like the Fourier transform, they are essen- 
tially rotations in function space from the pixel basis to appropriate 
wavelet basis functions. Unlike Fourier transforms, the basis func- 
tions can be well-localised in both real and Fourier space (even 
though they can only be compactly supported in one of the two). 



2.1 The continuous wavelet transform 

The continuous wavelet transform of a one-dimensional square in- 
tegrable function / £ L 1 (R) can be defined as 

W(a,b) = r f( x )-L W (—) dT: 

J-oo ^a \ a J 

where the function \|/(x) is the wavelet (often called analysing 
wavelet or mother wavelet). The real numbers a > and b are 
scale and position parameters respectively. Dilations and transla- 
tions of V|f(x) can be derived by varying a and b. Obviously, the 
wavelet transform is linear. The inverse wavelet transform is given 
by 




where the normalisation constant Cy = 2iz \k\~ l \\\r(k)\ 2 dk can 
be obtained from the Fourier transform y(k) of the wavelet func- 
tion \|/(x) 

In order to construct a basis from the translations and dila- 
tions of the wavelet function l(/(x), a second function <])(x) must be 
introduced. It is called the scaling function or father wavelet, and 
its relation to \|/(x) will become clear in a moment. For the result- 
ing basis to be discrete, compact and orthogonal, the wavelet func- 
tions \|/(x) and (])(x) must obey a set of ma thematical restrictions 
first derived by Daubechies (e.g. IPaubechiesI 1 992b . among them 

J (|)(x)dx=l, J \|f(x) dx = 0, (2) 

and the normalisation 

£j V | 2 (x)dx=l. (3) 

A wavelet basis can then be constructed as follows. It is convenient 
to take special values for a and b in defining the basis: a = 2~ J and 

' The normalisation of assumes a Fourier transform that is scaled by 
1 / i/27l in both directions. 
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(a) (b) 

Figure 1. Wavelet functions l|/(x): (a) Haar wavelet and (b) Daubechies-4 
wavelet. 

b = 2 J l, where j and / are integers labelling the scale of the wavelet 
and the position at this scale. The resulting wavelet functions are 

= 2^(2^-0, 

y Jtl (x) = Zi\\f(2j x -l). 

It can be shown that the set {§oi:Vj,l} with j ^ and — °° < / < 
ao forms a complete orthonormal basis in L 2 (R). One may then 
expand a function f(x) as 

/(*)= £ co^/W+E E "OVY/vM. (4) 

l=-oo y=0Z=-oo 

where the wavelet coefficients cq i and w,y are given by 

w jy = / /(x)\|/ jV (x)dx. (5) 

Since the wavelet basis consists of dilations and translations 
of the mother and father wavelets \|/(x) and <j)(x), one can ob- 
tain a different orthogonal wavelet basis for each pair of these 
functions that obey the Daubechies conditions. Thus there exists 
an infinite number of possible wavelet transforms, with different 
wavelet bases making different trade-offs between how compactly 
they are localised in either real space or frequency space. Unfor- 
tunately, in nearly all cases, the wavelet basis functions cannot be 
expressed in closed form in terms of simple functions. Neverthe- 
less, two of the more commonly used wavelet basis functions, the 
Haar and Daubechies wavelets, are plotted in Fig.Q The proper- 
ties of wavelet and scaling functions can be more easily understood 
within the framework of multiresolution analysis (MRA), which is 
discussed in more detail in Appendix IaI 

2.2 The discrete wavelet transform 

In many applications, one is not so much interested in a func- 
tion f(x) defined at all values of x, as in its samples /(x,) at 
N — 2 J equally-spaced points x,-. By analogy with the discrete 
Fourier Transform (DFT), this function can be represented by its 
J detail scales and its smooth component. Equation becomes 

y-i2'-i 

/(*;) = co,o<j>o,o(*i) + E E w j,lVj,i( x i)> ( 6 ) 
j=o 1=0 

and the integrals {3} for the wavelet coefficients have to be replaced 
by the appropriate summations. If the mean of the function samples 
f(xi) is zero, then cq o = and the function can be described en- 
tirely in terms of the wavelets As the scale index j increases 
from to J — 1, the wavelets represent structure of the function on 



increasingly smaller scales, where each scale is by a factor of 2 
more detailed than the previous one. The index / (which runs from 
/ = to 2 J — 1) denotes the position of the wavelet j within the 
jth scale level. 

If we collect the function samples /,- = /(x,) in a column vec- 
tor / (whose length N must an integer power of 2), then the DWT 
(like the DFT) is a linear operation that transforms / into another 
vector / of the same length, which contains the wavelet coefficients 
of the (digitised) function. The action of the DWT can therefore be 
described as a multiplication of the original vector by the N x N 
wavelet matrix W: 

7 = W/. (7) 

Again like the DFT, the matrix W is orthogonal, and the inverse 
transformation can be performed straightforwardly using the trans- 
pose of W. Thus both the DFT and DWT can be considered as 
rotations from the original orthonormal basis vectors e, in signal 
space to some new orthonormal basis S, (i = 1,...,N), with the 
transformed vector/ containing the coefficients in this new basis. 

The original basis vectors e,- have unity as the rth element and 
the remaining elements equal to zero, and hence correspond to the 
'pixels' in the original vector /. Therefore the original basis is the 
most localised basis possible in real space. For the DFT, the new 
basis vectors e,- are (digitised) complex exponentials and represent 
the opposite extreme, since they are completely non-local in real 
space but localised in frequency space. For the DWT, the new ba- 
sis vectors are the wavelets, which enjoy the characteristic prop- 
erty of being fairly localised both in real space and in frequency 
space, thus occupying an intermediate position between the origi- 
nal 'delta-function' basis and the Fourier basis of complex expo- 
nentials. Indeed, it is the simultaneous localisation of wavelets in 
both spaces that makes the DWT such a useful tool for analysing 
data in wide range of applications. 



2.3 The a trous transform 

So far we have restricted our attention to discrete wavelet bases that 
are orthogonal and non-redundant (i.e. the number of wavelet coef- 
ficients equals the number of points at which the original function 
is sampled). By relaxing some of the Daubechies conditions J2j 
and 0, one may represent a sampled function in terms of wavelet 
bases that are both non-orthogonal and redundant. Although this 
may a t first seem a retrograde step, lLanger. Wilson & Andersonl 
1 1993) have suggested that non-orthogonal, translationally and ro- 
tationally invariant (isotropic) wavelet transforms are better suited 
to the task of image reconstruction than orthogonal on es. 

The a trous ('with holes'') algo rithm fflolschneid er et all 
Il989t iBiiaoui. Starck & Murtaghl fl994l) is an example of a non- 
orthogonal, redundant discrete wavelet transform that is widely 
used in image analysis. Starting from a data vector cji (I = 1 , . . . , N, 
2 J ^ N) iteratively smoothed vectors are obtained by 

c j-\,k = Y* HlC j.k+2V-'H' ( 8 ) 
I 

where each step is effectively a convolution of the image with 
the filter mask H\ using varying step sizes 2 J ~ J . At each scale, 
the detail wavelet coefficients contain the difference between the 
smoothed image c,_i ^ and the image c;^ at the previous scale: 

Since no decimation is carried out between consecutive filter steps, 
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Figure 2. The effective point spread functions of the a trous transform for 
the triangular scaling function 1 1 01 . The wavelet coefficients at each scale 
are given by convolutions with the increasingly wider wavelet functions 
(dotted line, dash-dotted line, dashed line). The smooth component is given 
by a convolution with the scaling function (solid line). 




Figure 3. The effective point spread functions of the a trous transform for a 
i?3-spline. The wavelet coefficients at each scale are given by convolutions 
with the increasingly wider wavelet functions (dotted line, dash-dotted line, 
dashed line). The smooth component is given by a convolution with the 
scaling function (solid line). Compare the triangular mask in Fig. [2] 



the a trous transform is redundant. Thus the final wavelet trans- 
formed vector has length J x N. The inverse a trous transform is 
simply the sum over the coefficients at all scales: 

/-1 

Cj.l = C(U + £ WjJ. 

1=0 

2.3.1 Properties of the a trous transform 

For the a trous transform the normalisation /|y(x)| 2 dx = 1 no 
longer holds. This means that some convenient properties of the 
orthogonal transform are lost. For instance, the orthogonal wavelet 
transforms of Gaussian white noise have a constant dispersion in 
all wavelet domains. This is not the case for the a trous transform. 

As discussed above, the a trous transform constructs the 
wavelet coefficients by successive applications of the same filter 
mask with different spacings between pixels, followed by a sub- 
traction of the smooth component in each step. This procedure is 
computationally efficient because of the compactness of the mask, 
as the wavelet coefficients in each domain can be constructed by 
a sum over only a small fraction of the coefficients on the previ- 
ous scale. Perhaps not entirely obvious from this construction is 
that this algorithm is equivalent to a convolution of the original im- 
age with a series of point spread functions, which do not only have 
different widths, but can also assume slightly different shapes on 
different scales. Wavelet coefficients are then given by the convo- 
lution 



'Y^fi-H-k c J,l 



(9) 



of the input vector and the wavelet function (— x) at the (j — 
l)th scale. For a symmetric wavelet \\fj_i(x) = x), this is 

identical to a convolution with the wavelet itself. Popular choices 
for the corresponding scaling function §(x) are the triangle function 



♦to 



if jc e [—1,1] 

otherwise 



(10) 



or a B3-spline IStarck. Murtagh & Biiaouill 998). 

The effective point spread functions for the triangular scal- 
ing function < 101 are shown in Fig.|2| The horizontal axis denotes 




Figure 4. The window functions for the triangular transform 1101 at differ- 
ent scales in Fourier space. These functions have been obtained by a Fourier 
transform of the functions in Fig.|2| The horizontal scale denotes spatial fre- 
quencies. 



pixel offsets. The first wavelet scale is given by a convolution with 
the narrow wavelet function \|/(x) (dotted line). The next scale is 
given by the same function, but scaled by a factor of 2 in width 
and j in height (dash-dotted line). The higher scales are produced 
from increasingly wider convolution masks. Finally, the last scale is 
the smooth component obtained from a convolution with the broad 
scaling function §(x) (solid line). 

Fig. shows the effective point spread functions for the 63- 
spline given by the convolution mask 

113 11 

16'4' 8'4'16, 

Note how the wavelet functions become smoother as their width 
increases relative to the pixel resolution. 

2.3.2 The a trous transform as a Fourier filter 

The convolution functions shown in Figs.[2]and[3]act as filters cor- 
responding to different spatial frequency bands. The corresponding 
window functions can be obtained from a Fourier transform of the 
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convolution masks. The window functions for the triangular trans- 
form from Fig. [2| are plotted in Fig.|4] The detail coefficients are 
given by the high-pass filter extending to the large Fourier modes 
(dotted line with circles). The coarser scales are given by filter func- 
tions moving to smaller and smaller Fourier modes. The smooth 
component can be obtained from the low-pass filter centred around 
the smallest frequencies (solid line), or longest wavelengths. The 
fact that the inverse transform is simply given by a sum of the de- 
tail scales and the smooth components ensures that all filters add up 
to a constant rectangular window of unit height (crosses on top of 
the diagram). The corresponding window functions for a B3 -spline 
look very similar, but their sidelobes are much smaller. 

2.4 Two-dimensional wavelet transforms 

The extension of the discrete wavelet transform to two dimen- 
sional objects or images is not unique. For orthogonal, non- 
redundant wavelet bases, there are two commonly used types of 
two-dimensional transform. The terminology used to name these 
transforms is quite variable, and different authors even use the 
same term to refer to different transforms. We choose to refer 
to them as 'tensor' and Mallat (or MRA) transforms respectively, 
even though both are constructed from tensor products of the 
one-dimensional wavelets. The tensor approach simply uses ten- 
sor products of the one -dimensional wavelets as a two-dimensional 
basis (e.g. Press et al. 1992). The MRA transform was developed 
by Mallat 1 1989) and uses dilated versions of three different tensor 
products of one-dimensional wavelets without mixing scales in dif- 
ferent directions. Both tensor and Mallat transforms are presented 
in more detail in Appendix IbI 

The non-orthogonal, redundant a trous transform lends itself 
particularly well to generalisations to higher dimensions. The con- 
volution mask for the two-dimensional a trous transform can be 
obtained from a two-dimensional rotation of the one-dimensional 
wavelet function. One of the most appealing features of the a trous 
algorithm is that it does not single out any special direction in the 
image plane, unlike the tensor products that are tailored to vertical, 
horizontal and diagonal image features. In this sense the a trous 
transform is isotropic. Furthermore, the a trous transform is also 
invariant under translations (the transform of the dilated function is 
simply the dilation of the transform). In practice, the convolution 
mask for the two-dimensional a trous transform can even be ob- 
tained from a simple product of two one-dimensional masks, while 
retaining the above properties to a sufficient approximation. 



3 WAVELET DENOISING 

The application of wavelets to the denoising of CMB maps has re- 
cently been proposed iTenorio et alJl999tlSanz et alll999Jbh . Fil- 
tering techniques exploit the fact that the information content of the 
wavelet-transformed image has been compressed into fewer coef- 
ficients. On the other hand, the dispersion of Gaussian white noise 
is constant across all wavelet domains after an orthogonal trans- 
form, because white noise has equal power on all image scales. 
Thus some wavelet coefficients become statistically more signif- 
icant than the original image pixels, since they are considerably 
enhanced compared to the noise. This observation leads to a filter- 
ing scheme where statistically significant coefficients are kept and 
the rest are discarded. The image obtained by an inverse transform 
then has much reduced noise. 

A simple realisation of such a filtering scheme uses 'hard 



thresholding' , where each wavelet coefficient h>; is multiplied by 
a factor 



Mi 



1 if > %i 
if \Wi\ < X; 



(11) 



For Gaussian noise, the threshold X; is usually chosen as some mul- 
tiple of the noise dispersion a^, i.e. X; = k<y{,, where th e £th coeffi- 
cient lies in the 7th wavelet domain. Tstarck et all ( 1998) propose to 
call the vector M from 1 1 1 i the multiresolution support. A common 
choice for the threshold is k = 3. 
For 'soft thresholding ' 



if wj ^ X/ 
if H<x,- 

if W; ^ X; 



(12) 



the value M,- can vary continuously. Coefficients that lie under the 
threshold are discarded, whereas the rest are shrunk by the values 
X/. Again, there are different prescriptions for choosing the x,-. One 
possibility is x,- = ka^ where k is of the or der of 1. A more ad- 
vance d method is the procedure SureShrink iDonoho & Johnstone! 
1995) which assigns the threshold x to each resolution level by min- 
imising the Stein Unbiased Estimate of Risk (SURE) for thresh- 
old estimates. This is basically a method to minimise the the es- 
timated mean squared error of the filtered coefficient. The com- 
putational cost of this procedure is of the order NlogN, where 
N is the number of coefficients in a domain. Additionally, one 
often sets a minimum x m j n and maximum x max threshold such 
that SureShrink is only applied in those wavelet domains j where 
^min ^ °d/°n ^ T max- This avoids damping of the signal for do- 
mains of high signal-to-noise and suppresses noise-dominated co- 
efficients. The soft thresholding filter is non-linear, since the values 
of the filter coefficients M, depend on the data w,. 

The multiresolution support as a mask for regularisation will 
be discussed in Section IB~2l In Section|7| we will use reconstruc- 
tions obtained from wavelet denoising as a benchmark for a com- 
parison with MEM techniques operating in the wavelet basis. 



4 THE MAXIMUM-ENTROPY METHOD 

The task of recovering the original image from blurred and noisy 
data is a typical inverse problem. This section discusses methods to 
draw inferences from data and to solve inverse problems. In partic- 
ular, we focus on the maximum-entropy method (MEM) of image 
reconstruction. We give here a discussion of the background to the 
standard MEM technique and then, in Section|5| highlight the en- 
hancements to the method provided by performing reconstructions 
in wavelet bases. 



4.1 The inverse problem 

In image reconstruction, the inverse problem consists of the task 
of estimating the N/, image pixels hi (i = 1 . . .Ni,) from the Nj data 
samples dj. One may find a solution by optimising som e measure o f 
the goodness of fit to the data (see, for example. lTitteringtonl 19851) . 
For Gaussian noise, the % 2 -statistic is the preferred measure: 



X 2 (h) = (Rh d) l N ^(Rh-d) 



(13) 



By minimising % , the vector h that best fits the data can be found. 
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4.2 Regularisation 

In image reconstructions, the number Nf, of parameters is the num- 
ber of image pixels, which can be very large. The parameter es- 
timates are poorly constrained by the data, and over-fitting leads 
to a bad reconstruction of the original image. In order to avoid 
over-fitting and wildly oscillating solutions, some kind of addi- 
tional information or regularisation is required. The regularisation 
is achieved by an additional function S(h) which penalises 'rough- 
ness' in the image. The choice of S(h) is determined by what ex- 
actly one considers as roughness. A compromise between the good- 
ness of fit % 2 (h) and the regularisation S(h) can then be found by 
minimising the function 



F{h) = \ 1 2 (h)-uS{h), 



(14) 



where a is a Lagrange multiplier which determines the degree of 
smoothing. As a is varied, solutions lie on a trade-off curve be- 
tween optimal fit to the data and maximal smoothness (see, for ex- 
ample, Titterington 1985). 

4.3 Bayes' theorem 

The previous analysis can be more coherently expressed in a 
Bayesian framework. Bayes' theorem can be used as a starting 
point to draw statistical inferences from data. It states that the con- 
ditional probability Pr(h\d) for a hypothesis h to be true given some 
datarf, the so-called 'posterior' probability, is given by 



Pr(h\d) : 



Pr(rf|/i)Pr(/i) 
Pi(d) ' 



(15) 



The probability Pr(d\h) is called the likelihood of the data, and 
Pr(h) is the prior probability of the hypothesis. It can be used to 
incorporate our prior beliefs or expectations on possible solutions. 
The probability Pi(d), the evidence, only depends on the data and 
can, for the time being, be viewed as a normalisation constant. 

For Gaussian distributed errors on the data points, the likeli- 
hood is then given by 

L(d\h)=Pr(d\h) = - ^1 -— e-iCtt-O'N-^m^), (i 6 ) 



(271)^/2, 



where y}(h) = (Rh d) l N 1 (Rh d) is the standard misfit statis- 
tic introduced in J13L If the parameters are well-constrained by the 
data, i.e. the likelihood function is narrow, the posterior probability 
will be sufficiently peaked and the errors on the parameters h will 
be small. Unfortunately, that is usually not the case in image recon- 
struction, where the number of parameters or image pixels is large. 
One then has to make use of the prior Pr(h) to obtain a solution. 

4.4 The maximum entropy method (MEM) 

If the likelihood function does not constrain the parameters suffi- 
ciently, the choice of a prior becomes important. There are many 
possible choices for the prior. The prior is usually of the exponen- 
tial form (e.g. lSkiUinJ"l 989) 



Pi(h) oc exp[aS(/i) 



(17) 



where S(h) is a regularisation function and a is a constant. Assum- 
ing a likelihood function given by 1 1 61 . the posterior probability 
from 1 1 51 then becomes 



Pi (h\d) oc exp 



■aS(h) 



Clearly, the optimal choice of the regularisati on has to reflect 
our knowledge of the expected solution (see lFriedenlll983h . A 
list of commonly used regularisation functions can be found in 
iTitteringtonl ll985l) . These are either functions that are quadratic 
in the hypothesis h or that use some kind of logarithmic entropy. 

An efficient prior is provided by the (Shannon) information 
entropy of the image. Restricting ourselves to images whose pixel 
values are strictly positive, the image can be considered as a posi- 
tive additive distribution (PAD). The entropy function Q is a gen- 
eralisation of the Shannon entropy. The measure m is often called 
the 'model', because the entropy is maximised by the default so- 
lution h, = m, (i = 1.. .N/,). Given an entropy function S(h), the 
posterior probability can be maximised by minimising its negative 
logarithm 

-]n[Pt(h\d)]=F(h) = l -x 2 (h)-aSih). (19) 

This is the maximum en tropy m ethod (MEM; see, for example, 
lGulll 989: Skilli nsll989HGull & Skillingll999h . 

In {jj, we have omitted a constant additive term that comes 
from the normalisation of the posterior probability. This term in- 
cludes the logarithm of the evidence Pr(rf). The evidence becomes 
useful because it is conditional on the underlying assumptions, for 
example the values of the constant a and the model m, and can be 
written as Pr(rf|a,m, . . .). From Bayes' theorem i 1 5 i . the posterior 
probability Pr(oc,m, . . . \d) for the model m and regularisation a de- 
pends on the evidence Pr(d\a,m, . . .): 

Pi(d\a,m, . . .)Pr(a,m, . . .) 



Prfa.ffj. 



■\d)- 



Pr(rf) 



(20) 



(18) 



Thus, in the same way as the likelihood discriminates between hy- 
potheses the evidence helps to discriminate between different pri- 
ors or models (again, e.g. iGull & SMlinel !l999), In Section l4~5l 
the evidence will be used to set a Bayesian value of the regulari- 
sation constant a. In image reconstructions the model m is often 
set uniformly across the image, to the value of the expected image 
background. Again, for a more refined analysis the evidence can be 
used to discriminate between models. 



4.5 The regularisation parameter a 

The parameter a introduced in <17> determines the amount of reg- 
ularisation on the image. It is clear that minimising % 2 only (by 
setting a = 0) would lead to a closer agreement with the data, and 
thus to noise-fitting. On the other hand, maximising the entropy 
alone by setting oc = °° would lead to an image which equals the 
default everywhere. Indeed, for every choice of a there is an im- 
age h (a) corresponding to the minimum of F(h) for that particular 
choice. The images h (a) vary along a trade-off curve as a is varied. 

There are several methods for assigning an optimal value to 
a. In early MEM applications, oc was chosen such that for the fi- 
nal reconstruction h the misfit statistic % 2 equalled its expectation 
value % 2 (h) = Nj, i.e. the number of data values. This choice 
is often referred to as historic MEM. It can be shown t hat it leads 
to systematic underfitting of the data l Titterington 1985). However, 
an optimum value of a can be ass igned within the Bayesian frame- 
work itself JGulJll989t IGull & Skillmg||l999h . Treating a as an- 
other parameter in the hypothesis space, or rather as a part of the 
model or theory that the parameter h belongs to, one can remove 
the dependence of the posterior on this parameter by marginalising 
over a: 

Pi(h\d)= J Pr(A|d,a)Pr(a|d) da. 
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Figure 5. The relationship between data, visible and hidden space. 



From H3, we have Pr(a|rf) °c Pr(rf|a)Pr(a). If the evi- 
dence Pr(rf|a) is sufficiently peaked, it will overwhelm any priors 
on a and one can simply use the optimal value a which maximises 
the evidenc e. This choice of a is called classic maximum entropy 
JGu11| 1989). It can be shown that the Bayesian value of a may be 
reasonably approximated by choosing its value such that the value 
of F(h) at its minimum is equal to half the number of data points, 
i.e. F^N d /2 <MacKavll992l) . 



4.6 Errors on the maximum entropy estimates 

There are two principal methods of quantifying errors on maximum 
entropy reconstructions. Firstly, one may evaluate the Hessian ma- 
trix H = V/,V|,F at the optimal h, from which the covariance ma- 
trix of the parameters is given by C = H ~~ 1 . However, this typically 
requires the inversion of a large matrix, which is unfortunately non- 
diagonal and so requires a large computational effort. Secondly, one 
may take 'samples' from the posterior probability distribution and 
quantify the error on any particular parameter hi (or combinations 
of parameters) by examining how hi varies between samples from 
the distribution l Gull & Skillins 1999). Additionally, when testing 
the method on simulations, it is always possible to quantify the er- 
rors on the reconstruction by using a Monte-Carlo approach. Since 
this is in fact the most robust method, we adopt it in Section^ in 
which we analyse some simulated data. 



4.7 The intrinsic correlation function (ICF) 

One of the fundamental axioms on which maximum entropy is 
based is that it should not by itself introduce correlations in the re- 
constructions l Gull & S killing 1999). However, it is often the case 
that a priori there is reason to believe that for a specific application 
the reconstructed quantities are not uncorrelated. Fortunately, any 
additional information on the correlation structure can be exploited 
to improve the quality of the reconstruction. 

Denoting the reconstructed image by v instead of h, the cor- 
relation structure of v can be encoded in a function K called the 
intrinsic correlation function (ICF). Using the ICF K, the recon- 
structed ( 'visible ') image v can be derived from a set of new ( 'hid- 
den ') parameters h by 



v = Kft. 



(21) 



Instead of reconstructing the image vector directly, the entropy is 
maximised with respect to the 'hidden image ' h. The elements of h 
are ideally a priori uncorrelated and of unit variance; the informa- 
tion on the correlations is absorbed into K. The relationship be- 
tween data, visible and hidden image is depicted in Fig. [5] The vis- 
ible reconstruction can be derived from the hidden image through 
the ICF, and the predicted noiseless data can be calculated using 
the response matrix and is given by d = Rv. 
If the covariance matrix C = (v/vj) is 
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known, the ICF can be 



constructed straightforwardly by a diagonalisation or Cholesky de- 
composition of C. Unfortunately the correlation structure is usually 
not known in advance, and a suitable ICF has to be found from em- 
pirical or heuristic criteria. 



5 TRANSFORMING THE INVERSE PROBLEM 

Without knowledge of the correlation structure, the ICF will have 
to be based on some assumptions about the correlations. One may 
prefer to find an ICF that works specifically well in certain cases, 
or one that performs well in the worst case. In the following, we 
will investigate several types of wavelet transforms as ICFs. 

Let us consider more closely the relationship between the data, 
visible and hidden image vectors. The N, D -dimensional data vec- 
tor d is an element of the data space £> . Similarly, the visible vector 
veV, dim V = N v and the hidden vector h e H , dim = . 
The data vector is then given by the transform 



(22) 



where the ICF K is a wavelet transform. In practice, we will use 
the transpose K = W' of the wavelet transform in this step, i.e. 
v = \N l h. For orthogonal wavelet transforms, W l = W 1 and this 
choice of K ensures that the transformation i» i— > h is given by 
the wavelet transform and the hidden space simply consists of 
the wavelet coefficients of the reconstruction. For non-orthogonal 
transforms, however, W l = W 1 does not hold. The % 2 (v)-function 
is defined on the space V of visible vectors v and the entropy func- 
tion S(h) on 9i . There are now two ways to construct a MEM al- 
gorithm. 

(i) The first method is to choose an entropy function S(h) in 
the space X of hidden images and to maximise the functional 
F(h) = jX 2 (K/«) — aS(h). In the following, we will call this ap- 
proach 'wavelet MEM'. Fig. [6| shows a schematic depiction of 
wavelet MEM. It is the straightforward way to implement the MEM 
in a hidden-space algorithm. A hidden-space ME M kernel is, for 
example, given in the software package MEMSYS5 iGull & Sta lling 
1999). We also note that a numerical implementation requires the 
evaluation of derivatives of F. These are discussed in Appendix 
C and, from IC1I . we see that the transpose of K is required for 
the evaluation of the gradient. This is why K = W' is a better 
choice than K = W 1 for non-orthogonal transforms. Furthermore, 
in Section l6~2l we will show that the use of the transpose has a very 
simple interpretation for the non-orthogonal a trous transform. 

(ii) Another alternative is to define F in terms of the visible 
space quantities: F(v) = j% 2 (v) — a5(K'v). (Since we defined 
K = W l , the transform applied to the vector v in the entropy term 
is simply the wa velet transform .) This method has bee n purs ued 
by, for instance,|Pantin & Starck 1 19960 and lStarck et alj<200l|) . In 
fact, the entropy S(Wv) can be viewed as a function on V , and its 
proponents call it the 'multiresolution entropy'. We will call this 
approach 'wavelet regularised MEM' . An illustration is shown in 
Fig. Q The transform K is not an ICF in the same sense as dis- 
cussed in Section 14771 since the parameters that are reconstructed 
by the MEM, the visible image pixels, are still correlated. We also 
note that for redundant wavelet transforms, this approach has the 
advantage that the numerical minimisation problem is lower dimen- 
sional, since < . 



8 Klaus Maisinger, M.P. Hobson and A.N. Lasenby 



regularisation 
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maximise 

Figure 6. Schematic depiction of wavelet MEM. The data are predicted 
by a composition of the transpose of the wavelet transform W 1 and the in- 
strumental response R. The entropic regularisation is applied to the hidden 
(wavelet) space, and the posterior functional is maximised with respect to 
the wavelet coefficients. 



'wavelet entropy 
regularisation 




maximise 

Figure 7. Schematic depiction of wavelet regularised MEM. The entropic 
regularisation is applied to the wavelet coefficients obtained from the im- 
age, but the posterior functional is maximised with respect to the image 
pixels. The wavelet transform and entropy functional act as a new effective 
'wavelet entropy'. 



Despite the use of the same linear transform K, 'wavelet regu- 
larised MEM' is not entirely equivalent to 'wavelet MEM' because 
of the non-linearity of the entropy functional. In Appendix IDI we 
show that both methods become equivalent for orthogonal wavelet 
transforms if a quadratic regularisation function is used. 

The type of wavelet transform is not the only free parameter 
one has to pick in wavelet MEM. For instance, there is still free- 
dom in the choice of the entropy functional 5. As mentioned be- 
fore, an optimal choice should reflect the expected distribution of 
the hidden image vectors h. A more detailed discussion of entropy 
functionals used with wavelets and maximum entropy is given in 
SectionlBTl 

There is no reason in the MEM that the data space <D and vis- 
ible space V be iden tical. For example, Bridl e~etal] 1 19981) and 
iMarshall et alj J2002) have applied maximum entropy to the re- 
construction of lensing mass profiles of galaxy clusters from shear 
or magnification data. In this case, © and V do not even share 
the same physical dimensions. However, in image reconstruction 
the data image and the reconstruction are often not only given in 
the same physical units, but on the same discretised grid, and we 
have D = <V . In this case, it is possible to incorporate information 
gleaned directly from the data into the choice of the regularisation. 
The ICF K(rf) and the entropy S(h,d) may become explicitly data- 
dependent. In practice, the data dependence is usually introduced 
by a modification of the model m. 



5.1 The entropy function 

From Q, for a positive additive distribution h, the cross-entropy 
S+{h,m) of the image h with some model m of the image is given 
by the sum 



where 



s + (h,m) = h — m — h\og 



h 



(23) 



The function s + (h,m) reaches a global maximum of zero at h = m. 
Thus, in the absence of data, the reconstruction takes on the default 
value m for all pixels; in practice, the value of m is often set to the 
level of the expected image background. 

Many astronomical images, such as maps of CMB fluctua- 
tions, generally consist of both positive and negative pixels. Fur- 
thermore, even if an image were purely positive, some of its wavelet 
coefficients may be negative. The entropy <23t is thus inapplicable 
to such images. A simple generalisation of 1231 to negative values 
is 



s 



(h,m) 



-m-\h\log — , 
m 



(24) 



which has been used by P antin& Starckl fi 996). This function does 
not have a unique maximum, since both h = ±|m| maximise the en- 
tropy. For h — ► 0, one finds su(h) — > — m. However, the entropy is 
not defined at zero, which is generally the expected default state of 
the image if the mean has been subtracted. In practice, the model 
values m have to be close to zero and small compared to any realis- 
tic data in order to avoid the introduction of a spurious background 
signal. Pantin & Starck 1 1996) use the value m = ka, where a is the 
rms of the data and k = is an arbitrarily chosen, small constant. 

The proper way to extend the entropy 1231 to images that 
take both positive and negative values is the e ntropy definition 
jHobson & Lasenb^l998tlGull & Skillin Jl999l) 



s±(h,m) = x f — 2m — h\og 



2m ' 



(25) 



where *P = \fh 2 + 4m 2 . The entropy s± has a maximum at h = 0. 
In the positive/negative entropy 1251 the role of the model m is dif- 
ferent from that in 1231 . The value of m determines the width of the 
entropy function and thus controls the magnitude of the allowed de- 
viations from the default value. Hence the model can be considered 
as a level of 'damping' imposed on the image. From a dimensional 
analysis, the obvious choice for m is to set it to the expected signal 
rms. If data and visible space are identical and the signal-to-noise 
ratio is sufficiently high, the rms of the observed data can be a good 
approximation to that of the signal. 

Fig. |U shows the entropy functions su and s±. From the 
definitions 1241 and I25i . we see that, for a given model, both 
functions differ only by a constant offset in the limit of large h: 
— > s±(h) +m [h — > oo). However, in order to minimise the 
cusp of Su around zero, the models will generally be chosen differ- 
ently, and the maximum of su will be significantly narrower than 
that of s± . 
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Figure 8. The entropy function s± (h) = A|f — 2m — /llog (solid line), its 
quadratic approximation s(h) = —h 2 / (4m) (short dashed line) and Su (h) = 

\h\ — m — \h\ log J-gL (long dashed line). In all cases, the model was chosen to 
be m = 1. The function su is not defined at h = 0. 
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one obtains the quadratic approximation 

h 2 



s 2 (h) = - 



Am 1 



(26) 



(27) 



which is plotted in Fig. [8] With this entropy, MEM reduces to a 
scaled least squares. The quadratic entropy can also be obtained 
from the large-a limit h~mof the standard entropy s+ (h), but with 
a different scale factor . In a MEM reconstruction, one evaluates 
the product ocS of the entropy S and a regularisation constant a. The 
constant a is not dimensionless; its dimension [a] = l/[h] is given 
by the dimension [h] of h. If the model is chosen proportional to the 
signal rms o"s, then, from 1261 . the product as becomes invariant 
under a rescaling of h if a is also rescaled by 1 /o"s. In fact, to first 
order, any change in the model m can be absorbed by a reciprocal 
change in the regularisation constant a: 



a 9 

as oc h . 



(28) 



This explains why s\.\ produces reconstructions similar to those 
obtained for to s± despite of its narr ow shape for small models. 
We note that in a more recent work, IStarck et alj 1200 ll) use the 
quadratic approximation . 

Finally we note that Pa ntin & Starcklil99 6) propose to choose 
the regularisation parameter a dependent on the scale or even the 
image pixel, instead of setting a constant a globally. This can be 
achieved by introducing an additional weighting factor oc, for each 
pixel i: 



S(h,m) = J^ai s(hi,mi) 



(29) 



However, from 1281 we see that this modification is again to first 
order equivalent to a corresponding change in the model value m;, 
and so can be achieved by adopting a non-constant model. 



5.2 Choosing the parameters a and m 

The choice of the regularisation parameter a has been discussed 
in Section l4~5l Classic maximum entropy uses a Bayesian choice 
of a, as implemented in the software package MEMSYS5. For 
wavelet regularised MEM, the maximisation takes place in visi- 
ble space and the Hessian of the effective entropy is not diago- 
nal, which me ans it cannot be implem ented in MEMSYS5. As men- 
tioned above. IFantin & Starckl £l996) propose to choose a pixel- 
dependent value of a in order to enhance or alleviate the regulari- 
sation on certain coefficients. They use 



a,ocrj^(l-M ; ; 



(30) 



in 1291 . where M,- is the multiresolution support i 1 H defined in Sec- 
tion|5] the ith pixel is assumed to lie in the jth wavelet domain and 
the factor o"^ is the noise dispersion in the jth wavelet domain. 
The factor (1 — Mi) reduces the regularisation of coefficients with 
a good signal-to-noise, and the factor introduces an additional 
pixel-dependent regularisation. 

The pixel-dependent factor a,- is scaled by the global param- 
eter a. In the following, we will simply employ the historic max- 
imum entropy criterion % 2 = N v to fix a global a in cases where 
MEMSYS5 cannot be used to determine a Bayesian value. 

In real-space MEM, there is often no a priori reason to assign 
a varying model to particular image regions, since in the absence 
of any additional information one would expect a constant signal 
dispersion across the image. The wavelet transform, however, is 
designed to enhance the signal-to-noise ratio on some wavelet coef- 
ficients to allow a sparse representation of the signal. Consequently 
one would expect different signal dispersions in each wavelet do- 
main, and a uniform choice of the model seems appropriate only 
within domains. If data and image space are identical, the signal 
dispersion in the jth wavelet domain a J s can be estimated from the 
dispersion of the data o"p and of the noise : 



i2 



Now the model can be set to m, = a{, where the ith pixel lies in 
the jth wavelet domain. If no analytic noise model is available, the 
noise dispersion can be obtained from Monte-Carlo simulations. 

Other models have been proposed that involve the use of 
signal-to-noise ratios rather than signal dispersions. For orthogo- 
nal transforms, the dispersion of the wavelet coefficients of white 
noise is scale-invariant and the signal-to-noise is proportional to the 
signal dispersions. 



5.3 Choice of the wavelet basis 

For a given data set, one has to chose a specific wavelet basis 
for a MEM reconstruction. Some bases will obviously be more 
suitable to fit the data than others, and one naturally wants to 
find a way to choose the best one. A common selection crite- 
rion is the information entropy of the image in the wavelet repre- 
sentatio n jCoif man & Wickerhause3ll992tTzhuang & Baraslfl994t 
lHobsonetaLH l999). As will be discussed in Section |7~T1 we find 
empirically that for orthogonal wavelets, the choice of the wavelet 
basis in the MEM algorithm does not play an important role. Only 
Haar wavelets seem to perform significantly worse than other types. 
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Figure 9. The multi-channel ICF. The hidden image consists of several im- 
ages or channels (left). The visible channels are obtained by convolutions of 
the hidden parameters with point spread functions of different widths (mid- 
dle). The visible image is a weighted sum of the image channels (right). 



6 MULTI-RESOLUTION RECONSTRUCTION 

The advantage of transforming the inverse problem to wavelet 
space is that it allows for a natural multiresolution description of 
the image. Indeed, our approach may be b e cons idered in the con- 
text of the multi-channel ICF proposed bv lWeiJ ll992l) . which we 
now discuss. 
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Figure 10. (a) The operation of a 3-level a trous transform on an image, 
(b) The transpose of the a trous transform operates on a wavelet vector with 
3 scales. Each scale is convolved with the corresponding wavelet, and the 
resulting vector is obtained by a sum over all scales. 



6.1 Multi-channel ICF 

Traditionally the ICF K is taken to be a convolution with some 
point spread function P(x). Thus the visible image is a blurred ver- 
sion of the hidden variables. Pop ular point spread functions com- 
prise B-s plines of var ious orders (Gull & Skillina 1999) or Gaus- 
sians (e.g. Weir 1992). One deficiency of the 'blurring' ICF is that 
the width of the point spread function introduces a characteristic 
scale for the pixel correlations. In most applications, however, ob- 
jects and correlations of varying sizes and scales are present in the 
same image. A straightforward generalisation of the blurring ICF 
is the multi-channel ICF l Weir 1992). This method is illustrated 
in Fig. [5] The hidden variables consist of a number of different 
images or channels hj. Each hidden channel is blurred with a dif- 
ferent point spread function K,- such that v, = K,7i,. The visible 
image v is obtained by a weighted sum v = w,v; of all blurred 
image channels v,, where the weight on the kth channel is denoted 
by Wj. The entropy function is applied to the hidden variables. If 
the weights on different scales are chosen appropriately, it is en- 
tropically favourable to represent extended structure in the visible 
image by a single or very few coefficient in the hidden domain. 

One weakness of this approach is that it introduces a large 
number of new free parameters, like the widths of the point spread 
functions and the weighting factors of the different channels 
(and a complete new hidden image for each scale). If there is 
a priori knowledge on the expected correlation scales of objects, 
the width of the point spread function can be chosen accordingly. 
However, this will rarely be the case. The reconstructions can be- 
come strongly sensitive to the chosen set of ICF widths, the num- 
ber of channels, the w eights and the models in different channels 
iBontekoe et alj Fl994). In the 'pyramidal MEM' iBontek oeet alj 
1994), the different channels contain different numbers of hidden 
parameters. With decreasing resolution, the number of parameters 
in each consecutive channel is reduced by a factor of a half in each 
image dimension. Bontekoe et al. 1 1994) find empirically that uni- 
form models and weights (i.e. = 1 for all channels k) and the 
same point spread function (scaled by factors of 2) can be used for 
all channels. They also note that the pyramid images can be inter- 
preted as spatial band-pass filters. 



We note out that the weights of the visible image channels 
play a similar role as the scale-dependent regularisation parame- 
ters a , from Section I5TT1 that were proposed by IPantin & Starckl 
1 1996). The different weights correspond to a rescaling of the hid- 
den image channels. From 1281 . this is again to first order equivalent 
to a corresponding rescaling (albeit quadratically) of the regulari- 
sation a or the model m. 

6.2 The a trous transform as ICF 

If one writes the non-orthogonal a trous transform as a series 
of convolutions with scaling or wavelet functions, as discussed 
in Section l2~3l one can show that, for the a trous transform, the 
ICF K = W l used in wavelet MEM is just a special case of a multi- 
channel ICF. 

From |9j, the a trous wavelet coefficients wj at the jth scale 
are given by the (discretised) convolution 

4 Z w lfi l>i ,/*. (3D 

k k 

of the original function or image / with a version of the wavelet 
at the 7th scale that was mirrored at the origin. For a symmet- 
ric wavelet function, this corresponds to a convolution with the 
wavelet itself. Examples of wavelet functions are shown in Figs. [2] 
and [3] (For an orthogonal wavelet the equivalent to 13 li would be 
w i = T,k Vk-lihk, where the factor of 2 in the index accounts for the 
decimation carried out between consecutive filter steps.) The trans- 
pose of the a trous transform operates on the wavelet coefficients 
and produces a new image g, which is given by 

j,i J-' 

Thus the transpose consists of a convolution of the 7th wavelet do- 
main with the corresponding wavelet \\r } and a subsequent summa- 
tion over all scales. The operation of the a trous transform and of 
its transpose is illustrated in Fig. 1101 

The effect of the transpose of the a trous transform is just the 
same as that of a multi-channel ICF. Its set of ICFs consists of the 
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Figure 11. The original image (a) used in the simulations described in the text and the 'data image' (b) obtained after a convolution with a Gaussian point 
spread function with a FWHM of 5 pixels and the addition of Gaussian random noise whose rms is half the rms of the original image. 



scaling function and the rescaled wavelet functions. Unlike in the 
standard multi-channel approach, the blurring functions take nega- 
tive values in some areas. As in Section|6] one can introduce dif- « 
ferent weights for all channels in the a trous transform, or one can 
simply adapt the default model on different scales to enhance the 
reconstruction quality. However, if one introduces scale-dependent % - 

weights Wfr, the inverse transform also needs to be rescaled. The 
nice property that the Fourier transforms of all wavelet functions 
add up to 1 is lost. S - 



7 APPLICATION TO SIMULATED DATA 

In this section, we compare the different maximum entropy meth- 
ods discussed in this paper by applying them to a set of simulated 
two-dimensional images. One of the problems of assessing the ca- 
pabilities of different reconstruction methods is that there is no 
single unique criterion for the quality of a reconstruction. Further- 
more, the outcome of any quality measurement depends on a large 
number of variables, such as the type and content of the image, 
properties of the instrument with which the data were observed (e.g. 
the point spread function and the noise) and the model and regular- 
isation constant used in the maximum entropy algorithms etc. In 
this section, we use the rms differences between the original image 
and the reconstruction as a measure of the reconstruction quality. 
Of course, in real applications one does not have the original im- 
age available for comparison, but nevertheless one would usually 
hope to minimise the expected difference between true and recon- 
structed image. In Section lTTl we use a photographic image as an 
arbitrarily chosen, but fairly general and typical test case for image 
reconstructions. In Section lTTl we then turn to the investigation of 
the reconstruction of CMB maps that are realisations obtained from 
an inflationary CDM model, and provide a useful astronomical test 
image that is complementary in its properties to the photographic 
image. 



o I 1 1 1 I 1 

12 3 4 

Figure 12. The dispersion of the wavelet coefficients of data (solid line), the 
signal (long-dashed line) and noise contribution (short-dashed line) of the 
image in Fig.[H](b). There are four levels in the a trous transform, where 
the level 1 corresponds to the coarse structure and level 4 is the domain with 
the most detailed structure. 



7.1 A photographic image 

As a first test image we use the photographic portrait shown in 
Fie. 1111 (a). The image contains 256 x 256 pixels and has had its 
mean subtracted and has been rescaled to an rms of 1. It is highly 
non-gaussian, and its sharp lines and contrasting continuous ex- 
tended regions provide useful characteristics to assess the visual 
impression of different reconstruction methods. We simulate obser- 
vations of this image by convolving it with different Gaussian point 
spread functions with FWHMs of 3, 5 and 10 pixels and adding 
Gaussian white noise with an rms of 0.1, 0.5 or 2. Bearing in mind 
that the original image had unity rms, this corresponds to signal- 
to-noise ratios of 10, 2 and 0.5. For each FWHM and noise level, 
we use Monte-Carlo simulations of 15 different noise realisations. 
One of the realisations obtained for a FWHM of 5 pixels and a 
noise level of 0.5 is shown in Fig. II 1K b). 

To each simulation, we apply several different reconstruction 
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Figure 13. The 4-level a trous transform of the image in Fig.EI(a). Each 
level is plotted on a different greyscale. 



algorithms. First, we use a real-space MEM algorithm whose image 
model is set to be constant across the image plane to the value of 
the estimated signal rms. The regularisation parameter a discussed 
in Section l4~5l is chosen from the historic MEM criterion which 
demands that the % 2 -statistic for the final reconstruction equal the 
number N v of data points, in this case 256 x 256. We also ap- 
ply wavelet regularised MEM and wavelet MEM implementations, 
each with Daubechies-4 tensor, Daubechies-4 MRA and a trous 
wavelets. The a trous wavelets use the triangle function J 101 with 
four levels. In all cases, the model is chosen to be constant within a 
given wavelet domain. The model values for each domain or scale 
are found by setting them to the expected signal dispersions as dis- 
cussed in Section l5~2l Fig. H2| plots data, signal and noise disper- 
sions for the 4-level a trous transform. One can clearly see that on 
coarse scales (level 1) the data are dominated by the signal and on 
the finest scales (level 4) by the noise. The 4 levels of the a trous 
transform are shown in Fig. 1131 Each level is plotted on its own 
greyscale. The detail levels would be much fainter if they were plot- 
ted on the same scale. 

Some of the reconstructions produced from the 'data' in 
Fig. II II (b) are presented in Fig. 1141 The first image (a) shows a 
reconstruction obtained with the wavelet regularised MEM using 
tensor wavelets. The image looks more or less smooth. However, 
there are some faint structures along the horizontal and vertical di- 
rections. These are the signatures of the highly non-differentiable 
Daubechies-4 wavelets (compare Fig.0(b) for the one-dimensional 
profile). The second image (b), which was produced with the a trous 
wavelets, is the visually most appealing reconstruction. The image 
is very smooth, but due to the high noise levels on the data there 
is no apparent attempt at a deconvolution of the point spread func- 
tion. In (c) the results from real-space MEM are shown. The image 
is dominated by the 'ringing' or little speckles that are character- 
istic for the real-space method. Visually, this image is clearly the 
poorest reconstruction of the MEM reconstructions. Finally, in (d) 
we show a reconstruction obtained from a SureShrink filter with 
Daubechies-4 tensor wavelets, as introduced in Section|5| As in (a), 



the spiky signatures of the Daubechies-4 wavelets are visible in the 
image. 

The reconstructions for the wavelet MEM using tensor and a 
trous wavelets are shown in Fig.^| They look very similar to the 
results obtained from the wavelet regularised MEM in Fig. 1141 

For a more quantitative comparison of the different methods, 
TableQlists the reconstruction errors for different FWHMs of the 
point spread function (3, 5 and 10 pixels) and different noise levels 
(0.1, 0.5 and 2). The quoted errors are the rms differences between 
the original image and the reconstruction averaged over 15 differ- 
ent noise realisations. The standard deviation of the error values is 
usually much less than 0.01 for low noise values and not more than 
0.03-0.04 in the high-noise case, so the error on the mean (of the 
errors) is expected to be not more than 0.01 in most cases. There 
are several trends apparent from the numbers: 

• For large point spread functions, all methods yield similar 
results (except perhaps for the real-space MEM at low signal-to- 
noise). 

• For high signal-to-noise, the methods also have a similar per- 
formance. 

• For sufficiently narrow point spread functions and poor signal- 
to-noise ratios, real-space MEM performs clearly worse than its 
competitors. 

• There is some indication that the wavelet MEM performs bet- 
ter than the wavelet regularised MEM. 

• Also, there are some hints that a trous wavelets may outper- 
form the tensor wavelets at least for the wavelet MEM, although in 
wavelet regularised MEM they are less able to cope with high noise 
levels. 

• In the low signal-to-noise regime the SureShrink filter matches 
the performance of the wavelet MEM techniques, indicating that 
the errors are entirely dominated by the noise. For low noise levels, 
however, MEM is more effective than the filter, which makes no 
attempt at a deconvolution. 

In the case when the point spread function is narrow and the noise 
dominant, it is most difficult for the MEM algorithms to distinguish 
between signal and noise and the performance differences become 
most prominent. 

Within the wavelet algorithms, there are of course many 
free parameters that may influence the reconstruction errors. We 
have performed simulations with different types of orthogonal 
wavelets and both two-dimensional tensor and MRA transforms. 
There is no significant difference except for the case of the simple 
Haar wavelets, which seem to perform slightly worse than other 
wavelets. Likewise, for the a trous algorithm the triangle function 
seems to be similarly efficient as the S3 -spline, and the number 
of levels in the transform does have marginal effects as long it 
is greater than a minimum of three or four. We have also tested 
different models for the wavelet coefficients. Generally, there are 
many models suppressing power on small wavelet scales that pro- 
vide a similar performance. A stronger relative penalty on small- 
scale structure, for example by using the variance instead of the 
dispersion of the signal, can be advantageous for poor signal-to- 
noise ratios, since most of the fine structure will be noise. We find 
no benefits from methods that attempt a more data-dependent reg- 
ularisation, for instance using the regularisation constant 1301 . 

Tracing back the reconstruction errors to the wavelet do- 
main, it appears that, not surprisingly, the dominant contribution to 
the errors comes from those domains where the signal-to-noise is 
close to unity. The other domains are either perfectly reconstructed 
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Table 1. Reconstruction errors for different methods as a function of the FWHM (in image pixels) of the convolution mask and of the noise level on the data. 
The numbers quote the rms differences between the reconstruction and the original image averaged over a number of noise realisations as described in the text. 
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Figure 15. Reconstructions of the simulated image from Fig. II ll with 'wavelet MEM': (a) using Daubechies-4 tensor wavelets; (b) using a trous wavelets. 
Compare the corresponding reconstructions obtained from 'wavelet regularised MEM' shown in Figs. 1141 (a) and (b). 
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Figure 16. The mis reconstruction errors as a function of the regularisation 
constant a, for a simulation with a 5-pixel FWHM blurring function and 
a signal-to-noise of 2. For real-space MEM (asterisks) and wavelet MEM 
using tensor (diagonal crosses) and a trous wavelets (perpendicular crosses), 
the plotted points show the final errors for a reconstruction with a fixed 
value of a. For each method, the small lines indicate the value of a that 
would have been obtained from the historic MEM criterion % 2 = ■ 



(the coarse structure) or contribute little to the image (the noise- 
dominated small-scale structure). 



7.1.1 The influence of the regularisation constant a. 

Beside the choice of the model, the selection of the regularisation 
constant a is one of the major problems in the maximum entropy 
method, as already discussed in Sections 14.51 and l5~2l By setting 
the % 2 -statistic to the number of data points, % 2 = iV® , one essen- 
tially fixes the 'goodness of fit', which may result in very similar 
error levels on the reconstructions independent of the basis func- 
tions or regularisation. Fie. ll6l illustrates the dependence of the re- 



construction error on the regularisation constant a for real-space 
MEM and wavelet MEM using tensor and a trous wavelet. These 
results are obtained from a simulation with a FWHM of 5 pixels 
and a noise level of 0.5. Each point corresponds to the rms differ- 
ence between the final reconstructions and the original image when 
the regularisation parameter a has been fixed to the given value. 
One can clearly see that there is a trade-off curve between too close 
a fit to the data (and thus to spurious noise) and too strong a reg- 
ularisation (and thus to suppression of real structure in the image). 
Both extremes result in high reconstruction errors. For each recon- 
struction method, a short horizontal line marks the point along the 
curve that is preferred by the % 2 = iVj, criterion. Two features of 
these curves are worth pointing out: 

• The global minima of the curves are indeed lower for the 
wavelet methods than for real-space MEM. This means that the 
lower errors of the wavelet methods in TableQare not merely arte- 
facts of a poorly chosen regularisation constant. On the contrary, for 
the real-space algorithm the historic MEM criterion actually picks 
out points that are very close to the global minimum of the curve, 
despite the visual ringing evident from the image. 

• The curves appear to be less narrowly peaked for the wavelet 
methods than for real-space MEM. This means the quality of the 
reconstruction will be less sensitive to a. For the wavelet methods, 
one can vary a by about an order of magnitude around the mini- 
mum without significantly affecting the reconstruction errors. 



7.1.2 A Bayesian choice of the regularisation constant a 

The proper Bayesian way to determine the regularisation constant a 
is to treat it as an additional model parameter and marginalise the 
posterior probability over a, i.e. integrate over all possible val- 
ues of a (see Section l4~5l . Because the evidence is often strongly 
peaked, it is usually sufficient to maximise the evidence to find 
the optimal value of a. The reconstruction errors for MEM with 
a Bayesian a (using MEMSYS5) are shown in Table |2| It is evi- 
dent that for the real-space MEM the reconstruction errors are very 
poor compared to those obtained with the historic MEM criterion 
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Table 3. The logarithm lnPr(rf) of the evidence Pr(rf) for a classic MEM reconstruction with a Bayesian choice of the regularisation constant a (classic MEM). 
The logarithms are normalised such that they equal zero for the real-space MEM for each dataset. 



in Table Q This is expected from the curve in Fig. 1161 the his- 
toric MEM criterion picks a value of a that is close to the mini- 
mum of the trade-off curve. The reconstruction errors can only get 
worse for a different choice of a. For comparison with Fig. 1161 
the Bayesian value for the real-space MEM is a = 0.12. In fact, it 
is this poor performance of the real-space method that historically 
lead to the introduction of ICFs and the search for a better image 
model (e.g. MacKav 1992). The wavelet MEM improves the re- 
construction errors dramatically, especially in the case of the tensor 
wavelets (for comparison with Fig. O a = 0.5). The errors even 
approach those obtained for the a trous wavelet MEM in Table Q 
The a trous transform (a = 0.04 in Fig |16l has problems for narrow 
point spread functions; it tends to allow too much image structure 
on small scales. However, by reweighting the individual scales of 
the transform suitably, the reconstructions can be much improved. 
The optimal weighting factors will be investigated in future work. 

The last row of Table |2| shows the reconstruction errors ob- 
tained from a Gaussian multi-channel ICF with four channels. 
The widths of the Gaussians are scaled by a factor of 2 between 
channels, and the weights are chosen such that the volume of the 
ICFs are identical between levels. Even though the weights are not 
specifically adapted to the data, the reconstruction errors are much 
lower than for real-space MEM. Again, the reconstructions can be 
much improved by the choice of different convolution masks and 
weightings. 

The evidence for classic real-space and wavelet MEM are pre- 
sented in Table[3] The values are the logarithms lnPr(rf) of the ev- 
idence Pr(d) as introduced in Section|4] The values have been nor- 
malised such that for a given FWHM and noise rms the evidence for 
the real-space MEM equals 1. The values presented here thus allow 
a comparison of the evidence for a model with ICF and the simple 
real-space MEM without ICF. The standard deviation across the 
different noise realisations is roughly 170 (in units of lnPr(rf)). It is 
apparent that the evidence for the ICF MEM and wavelet MEM is 
significantly higher than for the real-space MEM, as would be ex- 
pected from the improved reconstruction errors. Furthermore, the 
lower reconstruction residuals for the tensor transform are matched 
by a higher evidence. From a Bayesian point of view, the ICF model 



or the wavelet basis are therefore the favoured models for image re- 
constructions. Unfortunately, a similar calculation of the evidence 
for the wavelet regularised MEM is not possible, because the soft- 
ware package MEMSYS5 that is used for the reconstruction is lim- 
ited to regularisation functionals whose curvature matrix is diago- 
nal. 



7.2 CMB maps 

Having applied the wavelet MEM techniques to a general image, 
we now turn to the reconstruction of CMB maps. We apply the 
different methods to five different CMB maps that are realisations 
obtained from the inflationary cold dark matter (CDM) model. The 
map size is 16° x 16° with 256 x 256 pixels, corresponding to a 
pixel size of 3.75'. The maps are convolved with Gaussian beams 
of different beam sizes with FWHMs of 11.25' (3 pixels), 18.75' 
(5 pixels) and 37.5' (10 pixels) respectively. Gaussian white noise 
is added with different signal-to-noise ratios of Os/^n = 10, 2 and 
0.5. For each signal-to-noise ratio, we create five different noise 
realisations. With the five different input CMB maps, there are thus 
25 different combinations of sky and noise realisations available for 
a given FWHM and noise level. The maps are reconstructed on the 
same grid as the simulated data. Again we use the historic MEM 
criterion to determine the regularisation constant instead of classic 
MEM, since it produces sufficiently good reconstruction errors and 
is also easily applicable to wavelet-regularised MEM. 

The reconstruction errors for different methods are quoted 
in Table [4] Because of the wealth of information present on all 
scales in the CMB maps, the errors are generally larger than for the 
photographic image (compare Table 0. Although the differences 
in reconstruction errors are less conspicuous, the results confirm 
the conclusions drawn from the photographic reconstructions. For 
low signal-to-noise ratios and narrow point spread functions, the 
wavelet-based methods are superior to real-space MEM. However, 
there is no significant difference between wavelet regularised and 
wavelet methods, even though there is still some indication that a 
trous wavelets perform better than orthogonal ones for high signal- 
to-noise ratios. In the last row, we also show that the SureShrink 
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Table 4. Reconstruction errors between reconstructions and original CMB maps. The errors are averaged over a set of 25 simulations with 5 different noise 
and image realisations. 




Figure 17. (a) One realisation of the CMB maps used in the simulations described in the text. The map is normalised to an unity rms, and the axes are labelled 
in degrees, (b) The 'data image' obtained after a convolution with a Gaussian point spread function with a FWHM of 5 pixels and the addition of Gaussian 
random noise whose rms is half the rms of the original image. 



filter again cannot match the performance of maximum entropy for 
high signal-to-noise ratios. 

For illustration, one of the realisations of the CMB maps used 
in the simulations is shown in Fig. 1171 (a). A simulated observa- 
tion of this map using a point spread function with a FWHM of 
5 pixels and Gaussian random noise of 0.5 is shown in Fig. 1171 (b). 
Reconstructions using wavelet MEM are shown in Figs. 1181 (a) for 
Daubechies-4 tensor wavelets and (b) for a trous wavelets. While 
the reconstruction errors are virtually identical in both cases, the 
tensor reconstruction show distinctly non-gaussian features intro- 
duced by the spiky wavelet functions. Real-space MEM reconsruc- 
tions of the same data are plotted in Figs. 1181 (c) and (d). The recon- 
struction (c) was obtained with the historic MEM criterion; recon- 
struction errors are similar to those of the wavelet methods. The 
image (d) was obtained from a Bayesian choice of the regulari- 
sation constant with the classic MEM. The reconstruction quality 
is visibly inferior, which is confirmed by the reconstruction errors 
(0.59 compared to 0.40 for historic MEM). 

The averaged logarithms lnPr(rf) of the Bayesian evi- 
dence Pr(rf) obtained from reconstructions of the CMB maps with 
the classic MEM criterion are shown in Table [5] They confirm the 
results from Table [3] and the visual impression from Fig. 1181 The 
evidences for the wavelet methods are again significantly higher 



than for the real-space MEM, even though the relative evidence ra- 
tios are not quite as pronounced as for the reconstructions of the 
photographic image. 



8 CONCLUSIONS 

Wavelets are functions that enable an efficient representation of sig- 
nals or images. They help to identify and compress the information 
content of the signal into a small number of parameters. Because 
wavelet functions span a whole range of spatial scales, they can 
be used to describe signal correlations of different characteristic 
lengths. In this paper, we have investigated how wavelets can be 
combined with the maximum entropy method to improve the re- 
construction of images from blurred and noisy data. 

There are two principal ways to incorporate wavelets into 
the maximum entropy method. First, the wavelet transform can be 
treated as an intrinsic correlation function that is used to decor- 
relate the data (wavelet MEM). Secondly, the wavelet transform 
can be combined with the entropy functional into a new effec- 
tive wavelet entropy (wavelet regularised entropy). We have im- 
plemented both approaches for orthogonal wavelet transforms. An- 
other type of wavelet transform is the a trous transform. It is non- 
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Table 5. The logarithm lnPr(rf) of the Bayesian evidence Pr(rf) for diffemt reconstructions of the CMB maps. The values are averaged over a set of 25 simu- 
lations with 5 different noise and image realisations and are normalised such that they equal zero for the real-space MEM for each dataset. 



orthogonal and has the benefit of being invariant under translations 
and rotations of the image. We show that the a trous transform can 
be considered as a special case of a multi-channel ICF, in which 
the image is produced from a linear combination of images con- 
volved with point spread functions of different widths. The quality 
of the reconstruction depends on the relative weights assigned to 
each channel or scale. 

We have applied MEM implementations using both orthogo- 



nal and a trous wavelets to simulated observations of CMB tem- 
perature anisotropies. We find that while the relative weighting of 
scales or channels is important, there is a range of different weight- 
ings that can yield roughly similar results as long as they suppress 
small-scale structure in the image. It does not matter much whether 
the weighting is introduced by setting different channel weights 
or by rescaling the entropy expressions or default models. Weight- 
ings that suppress small-scale structure more efficiently can per- 
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form better for low signal-to-noise, while they are worse for high 
signal-to-noise. Furthermore, we also find that for images contain- 
ing structure on different scales, like CMB maps, methods that try 
to improve the reconstruction by ad hoc assignments of pixel- and 
data-dependent weights usually do not enhance the reconstruction 
qu ality. The more c omplicated reconstruction prescriptions given 
bv lPantin & Starckl ll99d) have no benefits for CMB map-making. 

As far as reconstruction errors are concerned, wavelet-based 
maximum entropy algorithms seem to match the standard MEM 
in pixel space (real-space MEM) for large point spread functions 
or low noise levels. There exist sufficient well-determined degrees 
of freedom in the data to make the image basis irrelevant. On 
the other hand, for poor signal-to-noise and narrow convolution 
masks, wavelet methods outperform real-space MEM. Thus the use 
of wavelet techniques can improve the reconstruction of images in 
many cases, while there is no disadvantage of using these methods 
in other situations. The improvement seems to be genuinely related 
to the basis set or ICF and not just an artefact of an improper choice 
of the regularisation. A Bayesian treatment of the regularisation 
constant and a comparison of the different reconstruction methods 
shows a much higher evidence for ICF methods than for the simple 
real-space MEM. In a Bayesian context, the wavelet basis can thus 
be interpreted as a better 'model' for the image. In this regard, it 
fulfills the promise of an improvement over t he real-space m ethod 
that the ICF was designed to address (see e. g. Mac Kavll992[) . 

The isotropic a trous transform or the multi-channel ICF can 
in some cases improve on orthogonal wavelets. Furthermore, they 
can be implemented within the MEMSYS5 maximum entropy ker- 
nel and thus be applied in a proper Bayesian maximisation scheme. 
To summarise, we conclude that the use of wavelets in MEM im- 
age reconstructions is a successful technique that can improve the 
quality of image reconstructions. 
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APPENDIX A: MULTIRESOLUTION ANALYSIS 

Multiresolution analysis provides a simple framework for describ- 
ing the properties of wavelet and scaling functions discussed in 
Section [2] Let £ 2 (R) be the set of square integrable functions. 
The multiresolution analysis is a sequence of closed subspaces 
i^AjsZ C X 2 (R) which approximate L 2 (M). The subspaces are 
nested 

... C C Vo C Vi C... 



MEM image reconstruction using wavelets 1 9 



such that their union UjeZ ^ j i s dense in £ 2 (I 
tion f{x) 



, and for any func- 



f{x)eVj&f(2x)ei> J+ i. 

In this picture, the scaling functions {<j)(x— Oj' £ ^} form an or- 
thonormal basis in the reference space 'P'q. Because V$ C V\, ele- 
ments of <Vq can be written as linear combinations of those of V\ : 



§(x) = Y J \^-H k if{2x-k). 



(Al) 



This expression is called a refinement relation for the scaling func- 
tion §(x). 

Given a subspace V,- and its basis of scaling functions (j) (x) , 
one can ask how this basis in l/i can be extended to a basis in "^j+i ■ 
The wavelet functions \j/(jt) can be introduced as a set of functions 
needed to complete the basis in In other words, they form a 

basis of the orthogonal complement Wj of n/j, i.e. l^j+i = Vj® 
•Wj. Consequently, the wavelet functions can be written in the form 
of a refinement relation 



¥ (*)=£%/2G*<K2x-£). 

k 



(A2) 



The coefficients Gi are related to the H% via G k = (— 1)*//[_/.. In 
a signal processing context, the sequences and are called 
quadrature mirror filters. The sequence is a low pass filter, while 
Gk acts as a high pass filter. 

The wavelet transform represents a function in terms of its 
smooth basis functions <|)o,/ £ and the detail functions V|f; ; £ Wj. 

Al The discrete wavelet transform 

For the discrete wavelet transform introduced in Section 12.21 
wavelet coefficients can be constructed directly from the data vec- 
tor cjj=f(xi) (7=1,..., iV) via 



9-1,* 



= ^H[_2k c j,l, 

I 

= J^ G l-2k c j.l- 



(A3) 



This prescription leads to a pyramidal algorithm for the implemen- 
tation of the discrete wavelet transform. At each iteration, the data 
vector of length 2 J is split into 2 1 detail components and 2 J 
smoothed components. The smoothed components are then used as 
input for the next iteration to reconstruct the detail coefficients at 
the next larger scale. 

A2 The a trous transform 

For the a trous transform from Section l2~3l a data vector Cj ; (/ = 
1, ... ,N, 2 J ^ AO is iteratively smoothed with a filter mask Hi J8}: 



c j-i,k - Y* H i c iW> J -> ] i- 
I 

The coefficients H\ can be derived from a refinement relation of the 
form 

♦(x)=£2fli4>(2*-0, 
/ 

where (|)(x) is a scaling function. Note the difference of a factor \p2 
compared to <AH . The a trous wavelet vector is given by 

w j-l-k = Y, GlC j,k+X J -»l = c JM^ c j-l,k, 
I 



2 2 



Figure Bl. The partitioning of the wavelet image into different domains for 
the tensor transform. There are J xj domains mixing different scales j\ and 
]2. Increasing j corresponds to finer scales. 



from which G/ = 8/.o — H[. For the wavelet \|/(x) one finds the scal- 
ing relation 

=£2Grt(2x-i)=2tf2*)-4>(x). 



APPENDIX B: TWO-DIMENSIONAL WAVELET 
TRANSFORMS 

Bl Tensor products 

The simplest approach to extending the wavelet transform to two 
dimensions uses tensor products of the one -dimensional wavelets 
as a two-dimensional basis (e.g. IPress et allll992h . The resulting 
algorithm is straightforward. First one performs a wavelet trans- 
form on the first index of the image matrix for all possible values 
of the second, and then on the second. The tensor basis functions 
mix one-dimensional wavelets from different scales; they are given 
by 



00.0;/, ,k(x>y) 



Vh,h(x)<bo,i 2 (y), 



The two-dimensional pixelised image has dimensions N x N = 
2 Jl x 2 ]l and by analogy with jSJ one has ^ j\ ^ J\ — 1 and 
^ l\ ^ 2 jl — 1, and similarly for j'2 and 12- If we denote such an 
image by the (N x N-) matrix T, then the matrix of wavelet coeffi- 
cients is given by 



T = W (M) T = W 



(Id) 



T W 



(ld)t 



(Bl) 



where w' 1 ''' is the N x A'-matrix describing the one-dimensional 
transform that was introduced in Q, and W( w ' t is its trans- 
pose. The matrix T is partitioned into J\ x J2 separate domains 
of 2 Ji x 2 j2 wavelet coefficients (see Fig. IB It . according to the 
scale indices j\ and j'2 in the horizontal and vertical directions 
respectively. By analogy with the one-dimensional case, as j\ in- 
creases the wavelets represent the horizontal structure in the im- 
age on increasingly smaller scales. Similarly, as 72 increases the 
wavelets represent the increasingly fine scale vertical structure in 
the image. Thus domains that lie in the leading diagonal (i.e. with 
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Figure B2. The partitioning of the wavelet image into different domains for 
the MRA transform. At each scale j, there are 3 different domains: horizon- 
tal, vertical and diagonal. Again, increasing j corresponds to finer scales. 



j\ — j<i) contain coefficients of two-dimensional wavelets that rep- 
resent the image at the same scale in the horizontal and vertical 
directions, whereas domains with j\ ^ 72 contain coefficients of 
two-dimensional wavelets describing the image on different scales 
in the two directions. 



where d is the data vector and K the ICR The gradient and curvature 
can be derived straightforwardly: 



Va 2 = (RK) l N \RKh -d), 

v h va 2 = k'r'n'rk. 



(CI) 
(C2) 



For an implementation, one thus needs to provide the transfor- 
mation R, the wavelet transform W = K l and their transposes R' 
and W'. 



CI Real-space MEM 

For a maximum entropy algorithm operating on data and recon- 
structed images given in real space, i.e. in the same image plane, 
the data can be predicted by a convolution of the underlying image 
distribution h(x) with a point spread function P(x). The discrete 



predicted data samples df are given by 

df = £*yA(*j) = -Xj)h(Xj) 



(C3) 



The transpose R* of the response matrix is easily obtained. Its op- 
eration on an image vector h, where hj = h(xj) is given by 



Y,P(-Xi+Xj)h(Xj) 



(C4) 



B2 MRA transforms 

The MRA transform uses three different tensor products of one- 
dimensional wavelets. Dilated versions of these tensor products 
form the detail wavelets. They do not mix scales and can be de- 
scribed in terms of a single scale index j. At each scale level j, 
these bases are given by 

The /, wavelet is simply an averaging function at the jlh level, 
while the other three wavelets correspond to structure at the jth 
scale level in the horizontal, vertical^ and diagonal directions in the 
image. The structure of the matrix T from i B 1 1 is different for the 
MRA transform. Obviously, the matrix is square and its domains 
can be described by a single value of J. Fig. IB2I shows how it is 
partitioned into 3(7— 1) detail domains and one domain with the 
smoothed components. 



APPENDIX C: CALCULATION OF DERIVATIVES 

In an implementation of the maximum entropy algorithm, the pos- 
terior functional F(h) = j% 2 (h) — aS(h) has to be minimised nu- 
merically. Some minimisation routines, like the MEMSYS5 package 
(Gull&Skillina 1999), require first derivatives of F with respect to 
the (visible or hidden) image pixels, while others, like the simple 
Newton-Raphson method, additionally require second derivatives. 
The derivatives of the % 2 - and entropy terms can be calculated sep- 
arately. The % 2 -functional for a linear instrumental response ma- 
trix R is given by 

% 2 {h) = (RK/i rf)'N '(RK/i -d), 



This is a correlation of the functions P(x) and h(x), and not a con- 
volution. For a symmetric beam, R = R*. The terms which are 
useful to derive the curvature <C2t . are also straightforward to cal- 
culate. The convolution can be most easily implemented with FFTs. 
By substituting <C3> and <C4t into the expressions JC II and <C2t . 
and setting the ICF to the identity K = 1, one can obtain the deriva- 
tives of the % 2 -functional. 



C2 Wavelet MEM 

Wavelet MEM uses the real-space response matrix derived in Sec- 
tion and an ICF K that is given by the transpose W' of the 
wavelet transform. Derivatives can by obtained by substitution into 
the expressions <CH and JC2t . For the orthogonal wavelet trans- 
forms, the transpose of the wavelet transform is simply the inverse 
transform. The transpose of the a trous transform is given by 03- 
We note that the y} -curvature 

VjVri 2 = WR l N^RW' 

is numerically difficult to evaluate efficiently for the orthogonal 
transforms. However, for a diagonal noise covariance N, Nu = o?, 
the diagonal elements of the curvature are 



2„2 



9 

un n i,k,o,p ° 



pn- 



If the noise covariance is constant across pixels with O; = a, this 
reduces to 

3 2 % 2 2 

W = £ Za W nkRkiRioWt p 5p n , 
un " ikop ° 

which only needs to be evaluated once for each wavelet domain. 
This makes a calculation of the curvature feasible in a reasonable 
amount of time. 
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C3 Wavelet-regularised MEM 

In wavelet-regularised MEM, the data d p are predicted from an 
image h in the same way as in the real-space algorithm from 
Section IC1I The entropy functional, however, is calculated on 
the wavelet coefficients rather than the image pixels. This can be 
viewed as a new entropy functional created from a composition of 
the standard entropy and the wavelet transform. Numerical deriva- 
tives are given by 

V v [5(Wv)] = W l [V ft 5](Wv), 
V„V„[S(Wv)] = W t [V ft V /l 5](Wv)W. 

Again, second derivatives are numerically difficult to evaluate for 
orthogonal wavelet transforms. 



APPENDIX D: EQUIVALENCE OF METHODS 

In this section, we show that the 'wavelet MEM' and 'wavelet reg- 
ularised MEM' introduced in Section [5] become equivalent if the 
regularisation function is quadratic and the wavelet transform or- 
thogonal. 

For wavelet MEM, the x 2 -statistic is given by 

% 2 (h) = (RKh dfN 1 (RKh -d). (Dl) 

The most general form of quadratic regularisation is given by 

S(h)= h l M l h. 

In the special case of the quadratic approximation 1271 to the pos- 
itive/negative entropy with a model m, the matrix M is defined as 
M{j = 4mi$jj. The maximum entropy solution h can be found by 
minimising the function i 141 . 

F(h) = h 2 (h)-aS(h) 

= i(RK/i dfN '(RK/i d) + aJi t M~ 1 h 

= irf'N l d fc'K'R'N l d 

+fc t (^K t R t l\T 1 RK + alVr I )fc. 

Demanding that the gradient of F with respect to h vanishes at the 
maximum h, we obtain 

(jK'R'N^'RK + aM -1 )^ = K'R'INP'rf. (D2) 

For wavelet regularised MEM, we have 

X 2 (v) = (Rv dfN '(Rv -d) and 
S(KS>) = vKM 'K'v. 

The maximum entropy solution v is given by 

(^R'N 'R + aKM 1 K t )P = R'lNl 'rf. (D3) 

For orthogonal wavelet transforms W = K', KK' = 1 = K'K. Mul- 
tiplying <D3l by K', we obtain 

(^K'R'N 'RK + aM 1 )K t P = K'R'N l d. 

By comparison with <D2> . we see that h = K'v. The solutions for 
wavelet MEM and wavelet regularised MEM are identical. 



